Cross-modal Identification of Talkers

نویسنده

  • Lorin Lachs
چکیده

Recent evidence from experiments using sinewave speech shows that the linguistic content of a message, as well as the indexical characteristics of the talker can be perceived from the isolated kinematic form of speech utterances. Similarly, isolated visual kinematic information in the form of point-light displays has been shown to behave in much the same way that full visual displays of a talker articulating do (e.g., by enhancing intelligibility in noise). If the isolated kinematic visual form of speech is informative in speech perception, and the isolated kinematic acoustic form of speech can carry indexical information, then visual information should also be able to carry information regarding the indexical properties of the talker. If this is true, then perceivers should be able to use the information about an utterance obtained through one sensory modality (e.g., vision) and use it to identify the same utterance in the other sensory modality (e.g., audition). The present study examined the ability of participants to perceive and use either auditory or visual information about articulation across sensory modalities in identifying source characteristics of a talker's voice. Optical information about articulation has been shown to have substantial effects on speech perception (Massaro & Cohen, 1995). In the absence of auditory stimulation, visual information is sufficient for accurate speech perception (Bernstein, Demorest, & Tucker, in press). In conjunction with auditory information, visual information can enhance speech intelligibility in noise by +15 dB (Sumby & Pollack, 1954). Alternatively, incongruent information in the auditory and visual aspects of multimodal stimuli can interact to form illusory percepts (the “McGurk” effect McGurk & MacDonald, 1976). Because visual information about articulation can have such profound effects on speech perception, some theorists have proposed that the perceptually useful information in speech signals must be transmittable via acoustic as well as optic media. Indeed, some researchers have gone so far as to propose that the information is amodal; that is, the information for speech is not constrained to any particular sensory modality. In fact, the McGurk effect has been replicated by using auditory and tactile information about speech (Fowler & Dekle, 1991) demonstrating that some degree of useful information about speech can be obtained through sensory modalities other than audition. In another experiment designed to demonstrate that speech information can be carried in multiple sensory modalities, Green and Kuhl (1989) showed that the perceived VOT boundary for a synthetic /bi//pi/ continuum was shifted toward the VOT boundary for a /di/-/ti/ continuum when the stimuli were paired with the visual specification of a talker uttering the syllable /gi/. That is, the VOT boundary shifted in an appropriate manner for the illusory percept invoked by the McGurk illusion. In an earlier study, Green and Miller (1985) showed that the speaking rate information in optical displays influenced the identification of voiced or voiceless segments on an acoustic continuum that remained constant. These studies demonstrate that the dynamic aspects of visual information play a role in the perception of speech, and that the same auditory information can be perceived differently depending on the kinds of visual information available during perception. The exact form of such information remains the subject of some debate, but a growing body of research points to the possibility that the acoustic or optical forms of speech signals carry kinematic or dynamic information about the articulation of the vocal tract, and that such information drives the CROSS-MODAL IDENTIFICATION OF TALKERS 243 perception of linguistically relevant utterances (Fowler, 1986; Fowler & Rosenblum, 1991; Liberman & Mattingly, 1985; Rosenblum & Saldaña, 1996; Summerfield, 1987). The use of dynamic information has been demonstrated across multiple contexts. For example, Green and Gerdeman (1995) showed that cross-modal discrepancies in the vowel portion of McGurk stimuli influenced the degree to which the consonant portion was susceptible to the McGurk effect. The findings suggest that the perceptual system must be sensitive to non-segmental, coarticulatory information when it attempts to make sense of multimodal inputs. Another method used to study the problem of audiovisual integration in speech perception is the point-light technique (Johansson, 1973). By placing small reflective patches at key positions on a talker's face and darkening everything else in the display, one can isolate the kinematic aspects of visual displays of talkers articulating speech (Rosenblum, Johnson, & Saldaña, 1996; Rosenblum & Saldaña, 1996). Such “kinematic primitives” have been shown to behave much like unmodified, full visual displays of speech (Rosenblum & Saldaña, 1996). For example, the McGurk illusion can be induced by dubbing visual point-light displays onto phonetically discrepant auditory syllables (Rosenblum & Saldaña, 1996). In addition, an extension of Sumby and Pollack's (1954) findings has demonstrated that providing point-light information about articulation in conjunction with auditory speech embedded in noise can result in increased intelligibility (Rosenblum et al., 1996). All of the studies reviewed above, and indeed, most of the previous investigations of the effects of audiovisual information on speech perception have focussed on what are commonly referred to as the linguistic aspects of the signal: phoneme or syllable identification and spoken word recognition. However, a growing body of literature has shown that speech signals also carry information about the indexical properties of the talker, and that this information is perceived, stored in memory and used during speech perception and spoken word recognition (see Goldinger, 1998; Pisoni, 1997, for a review). Numerous recent studies have shown that the indexical properties of a talker's voice are stored in long-term memory (Bradlow, Nygaard, & Pisoni, 1999; Goldinger, Pisoni, & Logan, 1991; Martin, Mullennix, Pisoni, & Summers, 1989). For example, using a continuous recognition task, Palmeri, Goldinger, and Pisoni (1993) showed that repeating a word in the same voice that produced it during study facilitated later recognition of that word. Furthermore, the size of this effect did not change depending on the number of talkers uttering test items. This suggested that the encoding of voice attributes in memory is automatic and not controlled by strategic processes. The link between memorial encoding of fine-grained details of spoken words and perceptual processes has also been established (Nygaard, Sommers, & Pisoni, 1994; Nygaard, Sommers, & Pisoni, 1995). In one experiment, Nygaard and Pisoni (1998) trained participants to identify a set of novel talkers from their voices alone. Once the participants had learned the voices using a set of training stimuli, Nygaard and Pisoni found that the knowledge of talker characteristics obtained also generalized to new stimuli. Furthermore, the perceptual learning of the trained voices transferred to a novel task: words spoken by familiar voices were recognized more accurately in noise than words spoken by unfamiliar voices. But what kind of information about a talker is contained in speech, and how does that information contribute to speech perception? In an examination of the acoustic correlates of talker intelligibility, Bradlow, Torretta and Pisoni (1996) showed that while global characteristics such as fundamental frequency and speaking rate had little effect on intelligibility, acoustic-phonetic properties of voice, such as vowel space reduction and “articulatory precision”, were strong indicators of overall intelligibility. These findings suggest that indexical properties of a talker may be completely intermixed with the

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cross-modal signatures in maternal speech and singing

We explored the possibility of a unique cross-modal signature in maternal speech and singing that enables adults and infants to link unfamiliar speaking or singing voices with subsequently viewed silent videos of the talkers or singers. In Experiment 1, adults listened to 30-s excerpts of speech followed by successively presented 7-s silent video clips, one from the previously heard speaker (di...

متن کامل

When is indexical information about speech activated? evidence from a cross-modal priming experiment

Listeners were asked to judge talkers' sex from audio samples. Pictures of men, women, or a neutral visual stimulus were presented concurrent with, 150 ms before, or 150 ms after the spoken stimulus. Listeners' identification of sex for men's voices was most strongly affected by the visual stimulus when it was presented 150 ms after the stimulus. Voice-picture mismatches affected recognition of...

متن کامل

Experimental and finite-element free vibration analysis and artificial neural network based on multi-crack diagnosis of non-uniform cross-section beam

Crack identification is a very important issue in mechanical systems, because it is a damage that if develops may cause catastrophic failure. In the first part of this research, modal analysis of a multi-cracked variable cross-section beam is done using finite element method. Then, the obtained results are validated usingthe results of experimental modal analysis tests. In the next part, a nove...

متن کامل

Open-set identification of non-native talkers' language backgrounds

Listeners are skilled at detecting native talkers of a language, but can they identify specific non-native language backgrounds? Open-set identification was used to explore this question. Eighty monolingual American English-speaking listeners labeled the language backgrounds of 30 talkers with 5 different native languages (L1s) on the basis of syllableand word-length samples of English. As expe...

متن کامل

Is Consonant Perception Linked to Within-Category Dispersion or Across-Category Distance?

This study investigated the relation between the internal structure of phonetic categories and consonant intelligibility. For two phonetic contrasts (/s/-/ʃ/ and /b/-/p/), 32 iterations per category were elicited for each of 40 talkers from a same accent group and age range, and measures of cross-category distance and within-category dispersion were obtained. These measures varied substantially...

متن کامل

Cross-language Talker Identification

Two groups of monolingual, native English-speaking listeners were trained to identify the voices of ten German-English bilingual talkers. One group of listeners learned to identify the voices from English stimuli only, while the other group learned to identify the talkers from German stimuli only. After four days of training, both groups of listeners were asked to identify the same talkers from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000